Statistical Common Author Networks (SCAN) 
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A new method for visualizing the relatedness of scientific fields is developed that is based on measuring the 
overlap of researchers between fields. It is found that closely related fields have a high propensity to share 
a larger number of common authors. A methodology for comparing fields of vastly different sizes and to 
handle name homonymy is constructed, allowing for the robust deployment of this method on real data sets. A 
statistical analysis of the probability distributions of the common author overlap that accounts for noise is carried 
out along with the production of network maps with weighted links proportional to the overlap strength. This 
is demonstrated on two case studies, complexity science and neutrino physics, where the level of relatedness of 
fields within each area is expected to vary greatly. It is found that the results returned by this method closely 
match the intuitive expectation that the broad, multidisciplinary area of complexity science possesses fields that 
are weakly related to each other while the much narrower area of neutrino physics shows very strongly related 
fields. 



I. INTRODUCTION 

Understanding the growth and evolution of academic re- 
search fieldslll |2_ is important to assessing the health and in- 
fluence of scientific areas and can provide potentially impor- 
tant predictive capability in assessing technologies that may 
emerge from fundamental and applied research. A conse- 
quence of the large and growing number of highly specialized 
research areas is that identifying the productive intersection 
of these yD can no longer be done manually. However, the 
ready availability of computing power, the large frequency of 
published work and the relatively high data integrity of bibli- 
ographic databases provides the elements necessary for auto- 
mated screening and visualization of these areas. 

The visualization of research areas is an active area in bib- 
liometric studies |0, 0], largely using clustering of individual 
units to describe the relatedness of research areas. The pri- 
mary metric conferringjhis relatedness has historically been 
the citation frequency Q, with the individual unit of mea- 
sure being an instance of one publication citing another. Us- 
ing publications as nodes, a very complicated unweighted di- 
rected network could be formed relating publications together. 
Since this is visually confusing, the practice of re-assigning 
these nodes as either authors or journals[6] is preferable, re- 
sulting in a weighted network map where the clustering ob- 
served in these networks broadly reflects the topical areas of 
study. Intuitively, these methods work well at understanding 
the relatedness of topical areas because authors tend to cite re- 
search in the area of their study more frequently and journals 
tend to publish work that caters to a specific, topically focused 
scientific community. Of value in these visualizations are the 
areas of study that lie between topical clusters that represent 
interdisciplinary research, which can often give rise to emerg- 
ing scientific areas. 

While the methods described above use citation as the fun- 
damental unit of measure, we offer an alternative approach 
by showing how counting the occurrences of the same author 
working in multiple fields can provide the necessary linking 
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to relate these multiple fields to each other. Intuitively, this 
approach is motivated by the observation that scientists work- 
ing in one area of study will work in related areas of study 
more frequently than in unrelated areas, and so we expect a 
stronger connection between closely related areas. The ap- 
proach we carried out produces an undirected, weighted net- 
work map that differs from the practice described above in the 
following ways: (1) the nodes themselves are the topical areas 
of study, (2) the weight of the link connecting one node to the 
next is proportional to the number of authors shared between 
those topical areas, and (3) the clustering observed will define 
a major topical area composed of closely related topics. One 
of the values of using common authors over citations is that 
the links observed are much stronger since they require au- 
thors to develop deep expertise in these areas in order to pub- 
lish successfully in them, as opposed to a simple understand- 
ing of the work executed, which is the minimum requirement 
to cite another's work in the case of citation patterns. In this 
paper, we develop the procedures for establishing this link, in 
particular correcting for name homonymy in a statistical way. 

For our case studies to demonstrate this methodology we 
selected two extremely different areas: the complexity sci- 
ences and neutrino physics. While the latter is a traditional, 
narrow field of study that is deeply rooted in physics, the 
former is a multidisciplinary field that intersects with many 
other scientific areas, drawing upon the talents of many dif- 
ferent types of scientists. For this reason we intuitively an- 
ticipate a stronger degree of relatedness in neutrino physics 
than complexity science, and show that the method we devel- 
oped confirms that intuition. Last, we point out that while a 
significant number of methods in bibliometrics focus on rela- 
tionships between authors and papers (i.e. co-authorship! 7] 
or citation patterns) that elucidate the structure and pattern 
within these fields, this approach focuses on the relationships 
between these fields. 



II. METHODOLOGY 

The publications used to generate the common author 
graphs were drawn from the Institute for Engineering and 
Technology's Inspec publication database as accessed through 
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the Thompson Reuter's ISI Web of Knowledge v5.5 index. 
Inspec possesses broad coverage over the fields of physics, 
mathematics and computer science. Once the Inspec database 
was selected through the Web of Knowledge search interface 
the Boolean keyword or series of keywords best representing 
the sub-field under investigation were entered into the Inspec 
search field. The search was performed over the years 1969- 
2012, the longest time span available in the database, however 
the vast majority of searches returned results with shorter du- 
rations. Each keyword search typically returned 10 1 to 10 5 
publications. A custom Python script was written and used to 
pre-process the database by sub-fields to extract a list of au- 
thors, where the last name and initials were stored, and repeti- 
tions were removed. This produced a list of unique authors for 
every keyword search. These lists were then compared with 
each other to determine the number of authors the lists had in 
common. A symmetric matrix of pairwise comparisons was 
generated in this way using fast search algorithms in Python. 
Typical computing times were on the order of a few minutes 
for the generation of individual topics lists, while the over- 
lap between topics required on the order of several hundred 
searches over the sub fields and took approximately half an 
hour, using server-class hardware. 



III. DISCUSSION 

As a first approximation to quantifying the link between any 
two fields of study one can postulate the number of authors 
common to both fields. Unfortunately this naive approach suf- 
fers from two deficiencies that precludes its use as a measure 
of overlap: field size dependence and noise. Intuitively, it 
can be reasoned that the number of common authors depends 
on first order in some way on (1) the number of authors in 
each topic which varies by several orders of magnitude based 
on field size, and (2) the probability of false positives that 
arise from matching two authors that are different people with 
the same last name and initials. These occurrences, though 
rare, cannot be eliminated easily and are globally present and 
mostly uniform. For these reasons they will be referred to 
as noise arising from name homonymy, which is a persistent 
problem in bibliometrics(8|,|9|]. Below, we develop a treatment 
for both of these effects. 

First, we develop a treatment to deal with the large variation 
of field sizes that will affect the number of common authors 
in the pairwise matching. Let us consider a pool of names 
and from it extract two lists of names, N and M, containing 
n and m elements respectively with the restriction that n < m 
and that the names be unique within the lists, but not nec- 
essarily between each other. Let us start by comparing one 
element of N to one of the elements in the list of M. There 
are two outcomes: the element either matches an entry in the 
list with probability p, or it does not (with probability 1 — p). 
Since there are m elements in M, the probability of finding no 
matches between the first element in N to the entire list of M 
will be (1 — p) m . However, we are not interested in the case 
of no matches, but in the case of matches, that can now be 
approximated by: 1 — (1 — p) m , as the probability that a single 



element in N will match an element in list M. Now we pro- 
ceed to develop an expression for comparing the entirety of 
both lists to each other. As a first approximation we can mul- 
tiply the probability of the single element matching case by 
the number of elements, n, to produce the expression in Equa- 
tion 1, where (k) is the number of expected matches between 
the lists. 

(t)=»(l-(l-p)™) (1) 

For our purposes, the unknown variable is p, which allows 
us to re-arrange Equation 1 to 

'--(-£)* 

The above equation is a convenient expression but is still 
an approximation since every time there is a match, the num- 
ber of elements used in the comparison of m will be reduced 
and because each element in M is unique, then it follows that 
subsequent matches will not contain any of the prior matched 
elements. Therefore since the range is smaller, the probabil- 
ities must be adjusted every time there is a match. In order 
to validate our use of Equation 2, we carried out a Monte 
Carlo simulation of the exact solution over the range of n and 
m within the lists used in this study by generating matches 
between list for different values of p, computing an average 
number of matches (k), and trying to recover the initial value 
of p by using Equation 2. The result is that for sizes within the 
ranges used, there was less than 5% error between the Monte 
Carlo result and the analytical expression on Equation 2, con- 
firming our use of the latter as a valid approximation. 

Now that an expression for the matching probability has 
been developed, it can be used as a measure of the strength 
of the link between various fields of study, which describes 
the overall probability that authors in one field will also pub- 
lish in the paired field. While that is the focus of this study, 
it is first important to characterize the amount of noise arising 
from name homonymy. The statistics arising from the match- 
ing probabilities as calculated from the number of matched 
authors (the pairwise comparison matrix described in the 
methodology) can be used to determine this noise factor. To 
do this, we choose pairs of fields in which we intuitively ex- 
pect to find no true overlap of common authors, implying that 
the overlap found is due solely to name homonymy. Specifi- 
cally, we apply Equation 2 to the pairwise comparison of 25 
fields within neutrino physics to 25 fields within complexity 
science. This produces a matrix of values of matching prob- 
ability that describe the occurrence of name homonymy. We 
plot the histogram of these values in Figure 1, Top. The his- 
togram shows a very tight distribution of probabilities with 
a very Gaussian-like distribution. In order to validate this 
against ground truth, the list of authors identified in this way 
was randomly spot-checked by using the open source search 
engine Google along with the affiliations listed in their papers 
to find the specific individuals. It was verified that nearly all 
of the common authors found in this way corresponded to two 
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Figure 1 : Histograms of scientific fields surveyed plotted as a func- 
tion of the matching probability between fields of study(p) as cal- 
culated in Equation 2, using bins of the same size for comparison. 
Top: Probability values between areas in neutrino physics and com- 
plexity science is representative of name homonymy error. Middle: 
Probability values for complexity science. Bottom: Probability val- 
ues for neutrino physics. Note that insets in Top and Middle show 
higher resolution (more bins over a smaller range) while the inset of 
the Bottom shows wider range (more bins but also at a much wider 
range). 

or more distinct individuals, thus lending support to our asser- 
tion that this is a reasonable method to estimate the level of 
name homonymy. 

Similar statistical analyses were then carried out on the ar- 
eas of neutrino physics and complexity science, comparing 
fields within each area inclusively. In Figure 1 Middle, a his- 
togram plotting the matching probability values of the pair- 
wise comparison matrix of complexity science is shown. It 
can be seen that while there is a large number of matching 
probability values that correspond to the peak of the name 
homonymy noise, there are a significant number of matches 
that far exceed these values by as much as five times the noise 
value. Still, its similarity in the peak of the distribution to 
noise suggests that this is a very weakly related area of study 
where there are very few common areas between fields. This 
matches well with our intuition and knowledge of complexity 



science which tends to be strongly interdisciplinary, drawing 
scientists working in diverse areas such as sociology, biology, 
computer science and economics. Complexity scientists do 
not share common skillsets, training, or equipment and there 
remains debate on the defining elements elements and bound- 
aries of their field. 

In Figure 1 Bottom, a similar treatment is carried out for the 
field of neutrino physics. Here we find strong overlap between 
authors as exemplified by a shift in the distribution toward 
much higher matching probability values, as high as 40 times 
the name homonymy noise peak. The peak itself has shifted to 
somewhere between 4-6 times the noise value. This indicates 
that the field of neutrino physics is very strongly related, with 
a large number of scientists in one field publishing in many 
others. This also matches our intuition since we know that 
this area of study is very deeply rooted in physics, requiring 
very expensive specialized instruments and a much smaller, 
less diverse physics-oriented community. Physicists studying 
neutrinos have a very similar skillset and training and in fact 
not only use similar but sometimes the same equipment. 

Now we use the statistics gathered from the noise 
homonymy to correct the matching probabilities within ar- 
eas of studies for the undesirable phenomenon of name 
homonymy. Simply, we define the link strength (/) to be the 
matching probability (p) between fields within complexity sci- 
ence and neutrino physics minus the average of the matching 
probability(/?o) of the name homonymy between complexity 
science and neutrino physics, 

l=p-(po) 0) 

A plot of the fields of complexity science and neutrino 
physics are shown in Figure 2, where a higher link strength 
is represented by the thicker line weights for the lines con- 
necting each node. 

We observe additional intuitive verification when looking 
at the relative link weighting. For example in complexity sci- 
ence, very thick weighted lines connect the social related ar- 
eas: social network, social simulation, social systems, social 
cybernetics. Additionally, areas where there is little connec- 
tion also bears out our expectations. For example, the only 
field connected to particle swarm is swarming behavior, as 
expected. Fields which are subsets of each other also possess 
strongly weighted links as expected. For example in neutrino 
physics, there is a very strong link between beta decay and 
neutrinoless double beta decay, as papers (and therefore au- 
thors) of the former field also contain papers from the latter 
since the keyword of the former is included in the latter. 

We have described the development and execution of a 
novel method in visualizing scientific fields by looking at 
common authors, which is valuable in the study of emerging 
interdisciplinary areas. The links of this network are a func- 
tion of how easily the methods and training in one field can 
contribute to work in a related area. Compared to other knowl- 
edge mapping methods that look at patterns in citations and 
collaborations, this method is much more selective as com- 
mon authorship can only occur an authors has a depth of ex- 
pertise to allow him or her to publish original work in multiple 
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fields. A comparison of mapping differences between these 
measures is planned for future work. Last, since our devel- 
opment of an approach to estimate the expected noise arising 
from name homonymy and the further statistical treatment of 
establishing link weightings resulted in an approximation, fu- 
ture work is planned to explore this more accurately. 
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Figure 2: Network structure of complexity science (top) and neutrino 
physics (bottom) showing the relatedness of fields of study (nodes) as 
determined from the number authors that are common to each field 
of study. The thickness of the lines represents the link weight and 
is proportional to the matching probability between fields. Note that 
for clarity, the link weights are consistent within the top and bottom 

figures but not relative to each other; if done this way, then the lines V. REFERENCES 

in the top graph will be too faint to see. 
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